Ordinal Logistic Regression

Advanced Categorical Data Analysis

Author

Dr Muhammad Saufi

Published

July 18, 2024

This document serves both learning and practical purposes. It is designed for educational use, aiming to enhance statistical analysis skills and provide clear, organized notes for future reference.

1 Introduction

Ordinal logistic regression is a statistical analysis used for predicting an ordinal outcome based on one or more predictor variables. An ordinal outcome has categories with a natural order but unknown distances between categories, such as ratings (e.g., poor, fair, good, excellent).

The dependent variable is ordinal, meaning it has a meaningful order. Examples include survey responses (e.g., strongly disagree to strongly agree) or educational attainment (e.g., high school, bachelor’s, master’s, doctorate).

Ordinal logistic regression includes several models that are used to analyze ordinal data. The most common models are:

  1. Adjacent-Category Logit Model
  2. Continuation-Ratio Logit Model
  3. Proportional Odds Model (Cumulative Logit Model)

1.1 Adjacent-Category Logit Model

This model compares each response category with the next adjacent category. The model can be written as:

\[ \log \left( \frac{P(Y = j)}{P(Y = j+1)} \right) = \alpha_j + \beta X \]

  • \(Y\) is the ordinal response variable.
  • \(j\) is the category.
  • \(\alpha_j\) are the intercepts for the \(j\)-th category.
  • \(\beta\) is the coefficient for the predictor variable \(X\).

The model assumes that the log-odds do not depend on the specific category being compared. The model helps understand how the odds change from one category to the next adjacent category based on the predictor variables.

1.2 Continuation-Ratio Logit Model

This model is used to compare each response category with all lower categories combined. The model can be written as:

\[ \log \left( \frac{P(Y = j)}{P(Y < j)} \right) = \alpha_j + \beta X \]

  • \(Y\) is the ordinal response variable.
  • \(j\) is the category.
  • \(\alpha_j\) are the intercepts for the \(j\)-th category.
  • \(\beta\) is the coefficient for the predictor variable \(X\).

Different constant terms and slopes are allowed, making it more flexible. The model compares the probability of being in a particular category to the probability of being in all lower categories combined, providing detailed insights into transitions between levels.

1.3 Proportional Odds Model (Cumulative Logit Model)

The Proportional Odds Model is used to model the cumulative probabilities of the ordinal outcome variable. The model compares the cumulative probability of the response being in a category less than or equal to a specific value versus being in a higher category:

\[ \log \left( \frac{P(Y \leq j)}{P(Y > j)} \right) = \alpha_j + \beta X \]

  • \(Y\) is the ordinal response variable.
  • \(j\) is the category.
  • \(\alpha_j\) are the intercepts for the \(j\)-th category.
  • \(\beta\) is the coefficient for the predictor variable \(X\).

The relationship between predictors and the log-odds of being in a lower versus higher category is constant across all categories. The coefficients \(\beta\) indicate the effect of the predictors on the log-odds of the outcome being at or below a certain category versus above it.

2 Setting Up the Environment

Load the necessary libraries to ensure the R environment is equipped with the tools and functions required for efficient and effective analysis.

library(ordinal)    # to perform ordinal model
library(tidyverse)  # data wrangling
library(haven)      # to read statistical data
library(broom)      # to tidy results
library(gtsummary)
library(VGAM)
library(brant)

3 Practical 1

The wine dataset used in this exercise is used to illustrate how to apply cumulative link models (Proportional Odds Models) for analyzing ordinal data. This dataset contains 72 observations and 6 variables.

  1. response: A numeric variable representing the response value.
  2. rating: An ordinal factor variable with levels indicating the bitterness rating of the wine (from 1 to 5, with 1 being the least bitter and 5 being the most bitter).
  3. temp: A factor variable indicating the temperature at which the wine was served (e.g., cold, warm).
  4. contact: A factor variable indicating whether the wine had contact with air (e.g., yes, no).
  5. bottle: A numeric variable representing different bottle codes.
  6. judge: A factor variable representing different judges who rated the wine.

3.1 Data Reading and Exploration

First, load the dataset and convert any labeled variables to factors. Then, perform exploratory data analysis to understand the data distribution and relationships between variables.

# Reading the data
data.o1 <- read_dta("Datasets/wine.dta")

# Briefly examine the data
glimpse(data.o1)
Rows: 72
Columns: 6
$ response <dbl> 36, 48, 47, 67, 77, 60, 83, 90, 17, 22, 14, 50, 30, 51, 90, 7…
$ rating   <dbl+lbl> 2, 3, 3, 4, 4, 4, 5, 5, 1, 2, 1, 3, 2, 3, 5, 4, 2, 3, 3, …
$ temp     <dbl+lbl> 1, 1, 1, 1, 2, 2, 2, 2, 1, 1, 1, 1, 2, 2, 2, 2, 1, 1, 1, …
$ contact  <dbl+lbl> 1, 1, 2, 2, 1, 1, 2, 2, 1, 1, 2, 2, 1, 1, 2, 2, 1, 1, 2, …
$ bottle   <dbl+lbl> 1, 2, 3, 4, 5, 6, 7, 8, 1, 2, 3, 4, 5, 6, 7, 8, 1, 2, 3, …
$ judge    <dbl+lbl> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, …
# Summarize the data
summary(data.o1)
    response         rating           temp        contact        bottle    
 Min.   :12.00   Min.   :1.000   Min.   :1.0   Min.   :1.0   Min.   :1.00  
 1st Qu.:32.00   1st Qu.:2.000   1st Qu.:1.0   1st Qu.:1.0   1st Qu.:2.75  
 Median :46.00   Median :3.000   Median :1.5   Median :1.5   Median :4.50  
 Mean   :47.22   Mean   :2.917   Mean   :1.5   Mean   :1.5   Mean   :4.50  
 3rd Qu.:60.00   3rd Qu.:4.000   3rd Qu.:2.0   3rd Qu.:2.0   3rd Qu.:6.25  
 Max.   :90.00   Max.   :5.000   Max.   :2.0   Max.   :2.0   Max.   :8.00  
     judge  
 Min.   :1  
 1st Qu.:3  
 Median :5  
 Mean   :5  
 3rd Qu.:7  
 Max.   :9  
# Convert labeled variables to factor variables
data.o2 <- data.o1
data.o2 <- data.o2 %>% mutate(across(where(is.labelled), ~ as_factor(.x)))

# Check the data structure after conversion
glimpse(data.o2)
Rows: 72
Columns: 6
$ response <dbl> 36, 48, 47, 67, 77, 60, 83, 90, 17, 22, 14, 50, 30, 51, 90, 7…
$ rating   <fct> 2, 3, 3, 4, 4, 4, 5, 5, 1, 2, 1, 3, 2, 3, 5, 4, 2, 3, 3, 2, 5…
$ temp     <fct> cold, cold, cold, cold, warm, warm, warm, warm, cold, cold, c…
$ contact  <fct> no, no, yes, yes, no, no, yes, yes, no, no, yes, yes, no, no,…
$ bottle   <fct> 1, 2, 3, 4, 5, 6, 7, 8, 1, 2, 3, 4, 5, 6, 7, 8, 1, 2, 3, 4, 5…
$ judge    <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3…
# Summarize the characteristics of the dataset by `rating`
data.o2 %>%
  select(response, temp, rating, contact) %>%
  tbl_summary(by = rating)
Characteristic 1, N = 51 2, N = 221 3, N = 261 4, N = 121 5, N = 71
response 14 (13, 17) 31 (27, 35) 48 (45, 51) 67 (62, 72) 84 (82, 89)
temp




    cold 5 (100%) 16 (73%) 13 (50%) 2 (17%) 0 (0%)
    warm 0 (0%) 6 (27%) 13 (50%) 10 (83%) 7 (100%)
contact 1 (20%) 8 (36%) 13 (50%) 9 (75%) 5 (71%)
1 Median (IQR); n (%)

3.2 Estimation

The objective is to estimate the cumulative link model (proportional odds model) using the clm function from the ordinal package. This involves estimating regression coefficients, standard errors, and p-values. The cumulative link model (logit link) is specified as:

\[ P(Y_i \leq j) = \theta_j - \beta_1(\text{temp}) - \beta_2(\text{contact}) \]

Note

\(\theta_j\) are the threshold parameters, and \(\beta_1\) and \(\beta_2\) are the regression coefficients for temp and contact, respectively.

# Fit the cumulative link model
mod.o1 <- clm(rating ~ temp + contact, data = data.o2)
summary(mod.o1)
formula: rating ~ temp + contact
data:    data.o2

 link  threshold nobs logLik AIC    niter max.grad cond.H 
 logit flexible  72   -86.49 184.98 6(0)  4.02e-12 2.7e+01

Coefficients:
           Estimate Std. Error z value Pr(>|z|)    
tempwarm     2.5031     0.5287   4.735 2.19e-06 ***
contactyes   1.5278     0.4766   3.205  0.00135 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Threshold coefficients:
    Estimate Std. Error z value
1|2  -1.3444     0.5171  -2.600
2|3   1.2508     0.4379   2.857
3|4   3.4669     0.5978   5.800
4|5   5.0064     0.7309   6.850

Summary of Output

  1. Coefficients:

    • tempwarm: The estimated coefficient for tempwarm is 2.5031, meaning that having a warm temperature increases the log odds of being in a higher rating category by 2.5031, compared to the baseline category (cold temperature). The effect is statistically significant at the 0.001 level (p-value = 2.19e-06).

    • contactyes: The estimated coefficient for contactyes is 1.5278, meaning that having contact increases the log odds of being in a higher rating category by 1.5278, compared to the baseline category (no contact). The effect is statistically significant at the 0.01 level (p-value = 0.00135).

  2. Threshold Coefficients

    • 1|2: The threshold coefficient of -1.3444 separates the first category from the second. This value indicates the log odds of being in category 1 or lower versus category 2 or higher.

    • 2|3: The threshold coefficient of 1.2508 separates the second category from the third. This value indicates the log odds of being in category 2 or lower versus category 3 or higher.

    • 3|4: The threshold coefficient of 3.4669 separates the third category from the fourth. This value indicates the log odds of being in category 3 or lower versus category 4 or higher.

    • 4|5: The threshold coefficient of 5.0064 separates the fourth category from the fifth. This value indicates the log odds of being in category 4 or lower versus category 5.

  3. Interpretation

    • tempwarm: The estimate is 2.5031 with a p-value of 2.19e-06, indicating a significant positive association between warmer temperature and higher bitterness ratings.

    • contactyes: The estimate is 1.5278 with a p-value of 0.00135, indicating a significant positive association between contact with air and higher bitterness ratings.

# Tidy summary of the model
tidy(mod.o1)
# A tibble: 6 × 6
  term       estimate std.error statistic  p.value coef.type
  <chr>         <dbl>     <dbl>     <dbl>    <dbl> <chr>    
1 1|2           -1.34     0.517     -2.60 9.33e- 3 intercept
2 2|3            1.25     0.438      2.86 4.28e- 3 intercept
3 3|4            3.47     0.598      5.80 6.64e- 9 intercept
4 4|5            5.01     0.731      6.85 7.41e-12 intercept
5 tempwarm       2.50     0.529      4.73 2.19e- 6 location 
6 contactyes     1.53     0.477      3.21 1.35e- 3 location 

Interpretation

  1. 1|2: The threshold separating category 1 from category 2 has a log-odds estimate of -1.34, which is statistically significant (p-value < 0.01).

  2. 2|3: The threshold separating category 2 from category 3 has a log-odds estimate of 1.25, which is statistically significant (p-value < 0.01).

  3. 3|4: The threshold separating category 3 from category 4 has a log-odds estimate of 3.47, which is statistically significant (p-value < 0.001).

  4. 4|5: The threshold separating category 4 from category 5 has a log-odds estimate of 5.01, which is statistically significant (p-value < 0.001).

  5. tempwarm: The log-odds of the response variable being in a higher category increase by 2.50 when the temperature is warm compared to cold. This effect is statistically significant (p-value < 0.001).

  6. contactyes: The log-odds of the response variable being in a higher category increase by 1.53 when there is contact compared to no contact. This effect is statistically significant (p-value < 0.01).

# Tidy summary with exponentiated coefficients and confidence intervals
tidy(mod.o1, exponentiate = TRUE, conf.int = TRUE)
# A tibble: 6 × 8
  term       estimate std.error statistic  p.value conf.low conf.high coef.type
  <chr>         <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl> <chr>    
1 1|2           0.261     0.517     -2.60 9.33e- 3    NA         NA   intercept
2 2|3           3.49      0.438      2.86 4.28e- 3    NA         NA   intercept
3 3|4          32.0       0.598      5.80 6.64e- 9    NA         NA   intercept
4 4|5         149.        0.731      6.85 7.41e-12    NA         NA   intercept
5 tempwarm     12.2       0.529      4.73 2.19e- 6     4.53      36.4 location 
6 contactyes    4.61      0.477      3.21 1.35e- 3     1.85      12.1 location 

Interpretation

  1. 1|2: The odds of being in category 1 or lower versus category 2 or higher are 0.261 times, indicating a decrease in odds, which is statistically significant.

  2. 2|3: The odds of being in category 2 or lower versus category 3 or higher are 3.49 times, indicating an increase in odds, which is statistically significant.

  3. 3|4: The odds of being in category 3 or lower versus category 4 or higher are 32 times, indicating a substantial increase in odds, which is statistically significant.

  4. 4|5: The odds of being in category 4 or lower versus category 5 are 149 times, indicating a substantial increase in odds, which is statistically significant.

  5. tempwarm: The odds of the response variable being in a higher category are 12.2 times higher when the temperature is warm compared to cold. This effect is statistically significant, and the confidence interval (4.53 to 36.4) indicates the range within which the true odds ratio is likely to fall with 95% confidence.

  6. contactyes: The odds of the response variable being in a higher category are 4.61 times higher when there is contact compared to no contact. This effect is statistically significant, and the confidence interval (1.85 to 12.1) indicates the range within which the true odds ratio is likely to fall with 95% confidence.

3.3 Inferences

The inferences from the fitted cumulative link model are performed by:

  1. Checking Wald-based p-values
  2. Calculating confidence intervals
  3. Comparing models

3.3.1 Wald-Based P-Values

Wald-based p-values test whether each parameter (regression coefficient) is significantly different from zero. These were already included in the summary and tidy outputs, showing which predictors are significant.

3.3.2 Calculating Confidence Intervals

Confidence intervals provide a range of values within which the true parameter value is likely to fall. The confidence intervals were displayed in the previous tidy output with conf.int = TRUE.

3.3.3 Model Comparison using Likelihood Ratio Test

Let’s compare two nested models using the likelihood ratio test:

  1. Model mod.o1b: A simpler model with only temp as the predictor.
  2. Model mod.o1: A more complex model including both temp and contact as predictors.
# Fit the simpler model
mod.o1b <- clm(rating ~ temp, data = data.o2)

# Perform the likelihood ratio test to compare models
anova(mod.o1b, mod.o1, test = "Chisq")
'test' argument ignored in anova.clm
Likelihood ratio tests of cumulative link models:
 
        formula:                link: threshold:
mod.o1b rating ~ temp           logit flexible  
mod.o1  rating ~ temp + contact logit flexible  

        no.par    AIC  logLik LR.stat df Pr(>Chisq)    
mod.o1b      5 194.03 -92.013                          
mod.o1       6 184.98 -86.492  11.043  1  0.0008902 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Summary of Output

  • The p-value of 0.0008902 is highly significant, indicating that adding contact to the model significantly improves the fit compared to the simpler model with only temp.

  • Therefore, model o1 (which includes both temp and contact) is preferred over model o1b.

3.4 Checking Proportional Odds Assumption

The proportional odds assumption is crucial for the cumulative link model. It implies that the relationship between each pair of outcome groups is the same. There are two methods to check the proportional odds assumption.

3.4.1 Method 1: Using nominal_test Function

The ordinal::nominal_test() function tests the ordinality of each predictor. If the p-value is greater than 0.05, it fails to reject the null hypothesis that the proportional odds assumption holds for that predictor.

nominal_test(mod.o1)
Tests of nominal effects

formula: rating ~ temp + contact
        Df  logLik    AIC    LRT Pr(>Chi)
<none>     -86.492 184.98                
temp     3 -84.904 187.81 3.1750   0.3654
contact  3 -86.209 190.42 0.5667   0.9040

Interpretation:

  • For both temp and contact, the p-values are greater than 0.05 (0.3654 and 0.9040, respectively).

  • This indicates that the assumption of proportional odds cannot be rejected for these variables, suggesting that the proportional odds assumption holds.

3.4.2 Method 2: Likelihood Ratio Test

This method involves comparing two models:

  1. Model mod.o1

    • This model includes the predictors temp and contact and assumes that the proportional odds assumption holds for all predictors.
  2. Model mod.o1.nominal:

    • This model treats the predictor contact as having nominal effects, meaning it does not assume proportional odds for this predictor.

    • Instead, it allows the effect of contact to vary across different thresholds of the ordinal response variable.

# Fit the nominal model
mod.o1.nominal <- clm(rating ~ temp, nominal = ~contact, data = data.o2)
Note

nominal = ~contact specifies that contact should be treated as having non-proportional effects.

# Compare the two models using `anova` function
anova(mod.o1, mod.o1.nominal)
Likelihood ratio tests of cumulative link models:
 
               formula:                nominal: link: threshold:
mod.o1         rating ~ temp + contact ~1       logit flexible  
mod.o1.nominal rating ~ temp           ~contact logit flexible  

               no.par    AIC  logLik LR.stat df Pr(>Chisq)
mod.o1              6 184.98 -86.492                      
mod.o1.nominal      9 190.42 -86.209  0.5667  3      0.904

Interpretation:

  • The p-value is not significant at the 5% level, indicating that the more complex model (o1.nominal) does not provide a significantly better fit than the simpler model (o1).

  • The proportional odds assumption holds for the predictor contact. Therefore, the simpler model (o1), which assumes proportional odds for contact, is adequate.

3.5 Prediction

3.5.1 Method 1: Predicting Probabilities for Existing Data

This method focuses on predicting the probabilities (fitted probabilities) that each observation in the existing dataset falls into each response category.

# Remove the outcome variable (`rating`)
new.data1 <- data.o2 %>% select(-rating)
Note

The outcome variable (rating) is removed from the dataset because it is the response variable we want to predict.

# Predict the probabilities for each category
prob.mod.o1 <- predict(mod.o1, newdata = new.data1)
head(prob.mod.o1$fit)
           1         2         3          4          5
1 0.20679013 0.5706497 0.1922909 0.02361882 0.00665041
2 0.20679013 0.5706497 0.1922909 0.02361882 0.00665041
3 0.05354601 0.3776461 0.4430599 0.09582084 0.02992711
4 0.05354601 0.3776461 0.4430599 0.09582084 0.02992711
5 0.02088771 0.2014157 0.5015755 0.20049402 0.07562701
6 0.02088771 0.2014157 0.5015755 0.20049402 0.07562701
tail(prob.mod.o1$fit)
             1          2         3          4          5
67 0.053546010 0.37764614 0.4430599 0.09582084 0.02992711
68 0.053546010 0.37764614 0.4430599 0.09582084 0.02992711
69 0.020887709 0.20141572 0.5015755 0.20049402 0.07562701
70 0.020887709 0.20141572 0.5015755 0.20049402 0.07562701
71 0.004608274 0.05380128 0.3042099 0.36359581 0.27378469
72 0.004608274 0.05380128 0.3042099 0.36359581 0.27378469

  • The rows correspond to different observations in the dataset.

  • The columns represent the predicted probabilities for each response category (1 to 5).

class.o1 <- predict(mod.o1, type = "class")
head(class.o1)
$fit
 [1] 2 2 3 3 3 3 4 4 2 2 3 3 3 3 4 4 2 2 3 3 3 3 4 4 2 2 3 3 3 3 4 4 2 2 3 3 3 3
[39] 4 4 2 2 3 3 3 3 4 4 2 2 3 3 3 3 4 4 2 2 3 3 3 3 4 4 2 2 3 3 3 3 4 4
Levels: 1 2 3 4 5

  • The predict function with type = 'class' is used to predict the most likely category for each observation.

  • This output shows the predicted most likely category for each observation, represented by the numbers 1 to 5.

3.5.2 Method 2: Predicting Probabilities for New Data

This method involves predicting the probabilities that new data falls into each response category. This is useful for understanding model behavior under different conditions or for making predictions for hypothetical scenarios. Remember the categories of the outcome are:

levels(data.o2$rating)
[1] "1" "2" "3" "4" "5"

The linear predictor and predicted probabilites for these categories:

  1. temp = cold, contact = no
  2. temp = warm, contact = no
  3. temp = cold, contact = yes
  4. temp = warm, contact = yes
# Create a new dataset that includes all combinations of the levels of the predictors
new.data2 <- expand.grid(temp = levels(data.o2$temp), contact = levels(data.o2$contact))
new.data2
  temp contact
1 cold      no
2 warm      no
3 cold     yes
4 warm     yes

  • This creates a new data frame with all combinations of temp and contact.
# Get linear predictors for the new data
lp.new.data2 <- predict(mod.o1, newdata = new.data2, type = "linear.predictor")

# Get probabilities for the new data
prob.new.data2 <- predict(mod.o1, newdata = new.data2, type = "prob")

# View the results
lp.new.data2
$eta1
          1          2          3         4         5
1 -1.344383  1.2508088  3.4668869 5.0064042 100000.00
2 -3.847485 -1.2522932  0.9637849 2.5033022  99997.50
3 -2.872181 -0.2769889  1.9390893 3.4786065  99998.47
4 -5.375283 -2.7800909 -0.5640127 0.9755045  99995.97

$eta2
          1         2          3          4         5
1 -100000.0 -1.344383  1.2508088  3.4668869 5.0064042
2 -100002.5 -3.847485 -1.2522932  0.9637849 2.5033022
3 -100001.5 -2.872181 -0.2769889  1.9390893 3.4786065
4 -100004.0 -5.375283 -2.7800909 -0.5640127 0.9755045
prob.new.data2
$fit
            1          2         3          4          5
1 0.206790132 0.57064970 0.1922909 0.02361882 0.00665041
2 0.020887709 0.20141572 0.5015755 0.20049402 0.07562701
3 0.053546010 0.37764614 0.4430599 0.09582084 0.02992711
4 0.004608274 0.05380128 0.3042099 0.36359581 0.27378469

  • $eta1 and $eta2: Linear predictors for each category.

  • $fit: Predicted probabilities for each category for the new data.

# Combine the new data with the predicted probabilities for a comprehensive view
cbind(new.data2, prob.new.data2)
  temp contact       fit.1      fit.2     fit.3      fit.4      fit.5
1 cold      no 0.206790132 0.57064970 0.1922909 0.02361882 0.00665041
2 warm      no 0.020887709 0.20141572 0.5015755 0.20049402 0.07562701
3 cold     yes 0.053546010 0.37764614 0.4430599 0.09582084 0.02992711
4 warm     yes 0.004608274 0.05380128 0.3042099 0.36359581 0.27378469

3.5.3 Method 3: Using polr Function from MASS Package

This method involves using the polr function from the MASS package to fit a proportional odds model and then predict probabilities for each category of the ordinal response variable.

# Fit the model
mod_polr <- MASS::polr(rating ~ temp + contact, data = data.o2)
summary(mod_polr)

Re-fitting to get Hessian
Call:
MASS::polr(formula = rating ~ temp + contact, data = data.o2)

Coefficients:
           Value Std. Error t value
tempwarm   2.503     0.5287   4.735
contactyes 1.528     0.4766   3.205

Intercepts:
    Value   Std. Error t value
1|2 -1.3444  0.5171    -2.5998
2|3  1.2508  0.4379     2.8565
3|4  3.4669  0.5978     5.7998
4|5  5.0064  0.7309     6.8496

Residual Deviance: 172.9838 
AIC: 184.9838 

  • The model is fitted using the polr function from the MASS package. The output provides the estimated coefficients for the predictors and the thresholds (intercepts) for the ordinal response variable.

  • tempwarm: The coefficient for warm temperature, indicating a significant positive association with higher ratings.

  • contactyes: The coefficient for contact with air, indicating a significant positive association with higher ratings.

# Predict the probabilities for each category
prob_polr <- predict(mod_polr, type = "probs")
head(prob_polr)
          1         2         3          4           5
1 0.2067917 0.5706466 0.1922920 0.02361917 0.006650532
2 0.2067917 0.5706466 0.1922920 0.02361917 0.006650532
3 0.0535471 0.3776458 0.4430586 0.09582112 0.029927296
4 0.0535471 0.3776458 0.4430586 0.09582112 0.029927296
5 0.0208885 0.2014185 0.5015746 0.20049218 0.075626257
6 0.0208885 0.2014185 0.5015746 0.20049218 0.075626257
tail(prob_polr)
             1          2         3          4          5
67 0.053547101 0.37764585 0.4430586 0.09582112 0.02992730
68 0.053547101 0.37764585 0.4430586 0.09582112 0.02992730
69 0.020888502 0.20141847 0.5015746 0.20049218 0.07562626
70 0.020888502 0.20141847 0.5015746 0.20049218 0.07562626
71 0.004608507 0.05380284 0.3042139 0.36359456 0.27378016
72 0.004608507 0.05380284 0.3042139 0.36359456 0.27378016

  • The predict function with type = 'probs' is used to obtain the predicted probabilities for each response category for each observation in the dataset.

  • The rows correspond to different observations in the dataset.

  • The columns represent the predicted probabilities for each response category (1 to 5).

3.5.4 Manual Calculation of Predictions

Manually calculate the probabilities for an ordinal outcome using the coefficients from a cumulative link model mod.o1.

3.5.4.1 Model Summary of mod.o1

Review the model summary to get the coefficients for the predictors and the threshold values needed for calculations.

# Model summary
summary(mod.o1)
formula: rating ~ temp + contact
data:    data.o2

 link  threshold nobs logLik AIC    niter max.grad cond.H 
 logit flexible  72   -86.49 184.98 6(0)  4.02e-12 2.7e+01

Coefficients:
           Estimate Std. Error z value Pr(>|z|)    
tempwarm     2.5031     0.5287   4.735 2.19e-06 ***
contactyes   1.5278     0.4766   3.205  0.00135 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Threshold coefficients:
    Estimate Std. Error z value
1|2  -1.3444     0.5171  -2.600
2|3   1.2508     0.4379   2.857
3|4   3.4669     0.5978   5.800
4|5   5.0064     0.7309   6.850

3.5.4.2 New Data for Prediction

The new data points for which predictions are needed are defined using the previously generated new data from Prediction: Method 2.

new.data2
  temp contact
1 cold      no
2 warm      no
3 cold     yes
4 warm     yes

3.5.4.3 Obtain Linear Predictors

Use the predict function with type = "linear.predictor" to get the linear predictors for the new data points, represented by eta1 and eta2 in the output. Refer Prediction: Method 2.

lp.new.data2
$eta1
          1          2          3         4         5
1 -1.344383  1.2508088  3.4668869 5.0064042 100000.00
2 -3.847485 -1.2522932  0.9637849 2.5033022  99997.50
3 -2.872181 -0.2769889  1.9390893 3.4786065  99998.47
4 -5.375283 -2.7800909 -0.5640127 0.9755045  99995.97

$eta2
          1         2          3          4         5
1 -100000.0 -1.344383  1.2508088  3.4668869 5.0064042
2 -100002.5 -3.847485 -1.2522932  0.9637849 2.5033022
3 -100001.5 -2.872181 -0.2769889  1.9390893 3.4786065
4 -100004.0 -5.375283 -2.7800909 -0.5640127 0.9755045

3.5.4.4 Extract Coefficients

Extract the coefficients from the model that include the estimates for the thresholds and the predictor variables:

  • Thresholds: 1|2, 2|3, 3|4, 4|5
  • Predictors: tempwarm, contactyes
coef.mod.o1 <- coef(mod.o1)
coef.mod.o1
       1|2        2|3        3|4        4|5   tempwarm contactyes 
 -1.344383   1.250809   3.466887   5.006404   2.503102   1.527798 

3.5.4.5 Calculate Linear Predictors for Each Observation

For each new observation, calculate the linear predictors (bx) using the coefficients and the new data values. Using fourth observation as an example:

lp.mod.o1.bx4 <- coef.mod.o1["tempwarm"] * 1 + coef.mod.o1["contactyes"] * 1
lp.mod.o1.bx4
tempwarm 
  4.0309 
Note

This calculates the linear predictor for the fourth observation, where temp is warm (coded as 1) and contact is yes (coded as 1).

3.5.4.6 Calculate Logits

Using the linear predictors (bx) and the thresholds, calculate the logits using the formula:

\[ \text{logit}(P(Y \leq j)) = \beta_{0j} - (\beta_1 X_1 + \beta_2 X_2 + \cdots + \beta_p X_p) \]

  • \(\beta_{0j}\): Threshold coefficient for category \(j\).
  • \(\beta_1, \beta_2, \ldots, \beta_p\): Coefficients for the predictor variables.
  • \(X_1, X_2, \ldots, X_p\): Values of the predictor variables.
  • Logit Calculation: The linear predictor (\(\beta_1 X_1 + \beta_2 X_2 + \cdots + \beta_p X_p\)) is subtracted from the threshold coefficient to get the logit.

Given the threshold coefficients from the model summary:

  • \(\beta_{01|2} = -1.3444\)
  • \(\beta_{02|3} = 1.2508\)
  • \(\beta_{03|4} = 3.4669\)
  • \(\beta_{04|5} = 5.0064\)
logit1 <- coef.mod.o1[1] - lp.mod.o1.bx4
logit2 <- coef.mod.o1[2] - lp.mod.o1.bx4
logit3 <- coef.mod.o1[3] - lp.mod.o1.bx4
logit4 <- coef.mod.o1[4] - lp.mod.o1.bx4
logit1
      1|2 
-5.375283 
logit2
      2|3 
-2.780091 
logit3
       3|4 
-0.5640127 
logit4
      4|5 
0.9755045 

3.5.4.7 Convert Logits to Probabilities

The probabilities represent the cumulative probabilities for each threshold. The logistic function is used to convert logits (log-odds) into probabilities. The formula for the logistic function:

\[ \text{logit}^{-1}(x) = \frac{1}{1 + e^{-x}} \]

Note

\(x\) is the logit value.

pLeq1 <- 1 / (1 + exp(-logit1))
pLeq2 <- 1 / (1 + exp(-logit2))
pLeq3 <- 1 / (1 + exp(-logit3))
pLeq4 <- 1 / (1 + exp(-logit4))

3.5.4.8 Calculate Category Probabilities

To find the probability of each specific category, calculate the differences between the cumulative probabilities. This will give the probability of the response falling into each category.

pMat <- cbind(
  p1 = pLeq1,
  p2 = pLeq2 - pLeq1,
  p3 = pLeq3 - pLeq2,
  p4 = pLeq4 - pLeq3,
  p5 = 1 - pLeq4
)
pMat
             p1         p2        p3        p4        p5
1|2 0.004608274 0.05380128 0.3042099 0.3635958 0.2737847

3.5.4.9 Confirm Predictions with the Model

Finally, confirm the manual calculations by comparing them with the predictions from the predict function using type = "prob". Refer Method 2.

prob.new.data2
$fit
            1          2         3          4          5
1 0.206790132 0.57064970 0.1922909 0.02361882 0.00665041
2 0.020887709 0.20141572 0.5015755 0.20049402 0.07562701
3 0.053546010 0.37764614 0.4430599 0.09582084 0.02992711
4 0.004608274 0.05380128 0.3042099 0.36359581 0.27378469

4 Practical 2

The lowbwt.dta dataset used in this exercise is sourced from the Applied Logistic Regression book and involves data that are related to logistic regression models for ordinal outcomes.

Variables:

  • id: Identification number for each observation.
  • lbw: Indicator variable for low birth weight.
  • age: Age of the mother in years.
  • lwt: Weight of the mother at the last menstrual period in pounds.
  • race: Race of the mother.
  • smoke: Smoking status during pregnancy.
  • ptl: Number of previous premature labors.
  • hyper: Presence of hypertension.
  • urirr: Presence of uterine irritability.
  • pvft: Physician visits during the first trimester.
  • weight: Birth weight in grams.
  • agecat: Age category of the mother.
  • wcat: Weight category of the mother.
  • anyptl: Indicator variable for any previous premature labor.
  • newpvft: New variable for physician visits during the first trimester.

4.1 Data Reading and Exploration

# Data reading
data.o3 <- read_dta(("Datasets/lowbwt.dta"))

# Data summary
summary(data.o3)
       id             lbw              age             lwt       
 Min.   :  4.0   Min.   :0.0000   Min.   :14.00   Min.   : 80.0  
 1st Qu.: 68.0   1st Qu.:0.0000   1st Qu.:19.00   1st Qu.:110.0  
 Median :123.0   Median :0.0000   Median :23.00   Median :121.0  
 Mean   :121.1   Mean   :0.3122   Mean   :23.24   Mean   :129.8  
 3rd Qu.:176.0   3rd Qu.:1.0000   3rd Qu.:26.00   3rd Qu.:140.0  
 Max.   :226.0   Max.   :1.0000   Max.   :45.00   Max.   :250.0  
      race           smoke             ptl             hyper        
 Min.   :1.000   Min.   :0.0000   Min.   :0.0000   Min.   :0.00000  
 1st Qu.:1.000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.00000  
 Median :1.000   Median :0.0000   Median :0.0000   Median :0.00000  
 Mean   :1.847   Mean   :0.3915   Mean   :0.1958   Mean   :0.06349  
 3rd Qu.:3.000   3rd Qu.:1.0000   3rd Qu.:0.0000   3rd Qu.:0.00000  
 Max.   :3.000   Max.   :1.0000   Max.   :3.0000   Max.   :1.00000  
     urirr             pvft            weight         agecat     
 Min.   :0.0000   Min.   :0.0000   Min.   : 709   Min.   :1.000  
 1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:2414   1st Qu.:1.000  
 Median :0.0000   Median :0.0000   Median :2977   Median :2.000  
 Mean   :0.1481   Mean   :0.7937   Mean   :2945   Mean   :2.238  
 3rd Qu.:0.0000   3rd Qu.:1.0000   3rd Qu.:3475   3rd Qu.:3.000  
 Max.   :1.0000   Max.   :6.0000   Max.   :4990   Max.   :4.000  
      wcat           anyptl          newpvft      
 Min.   :1.000   Min.   :0.0000   Min.   :0.0000  
 1st Qu.:2.000   1st Qu.:0.0000   1st Qu.:0.0000  
 Median :3.000   Median :0.0000   Median :0.0000  
 Mean   :2.857   Mean   :0.1587   Mean   :0.6931  
 3rd Qu.:4.000   3rd Qu.:0.0000   3rd Qu.:1.0000  
 Max.   :5.000   Max.   :1.0000   Max.   :2.0000  
# Detailed view of the dataset
glimpse(data.o3)
Rows: 189
Columns: 15
$ id      <dbl> 85, 86, 87, 88, 89, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 1…
$ lbw     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ age     <dbl> 19, 33, 20, 21, 18, 21, 22, 17, 29, 26, 19, 19, 22, 30, 18, 18…
$ lwt     <dbl> 182, 155, 105, 108, 107, 124, 118, 103, 123, 113, 95, 150, 95,…
$ race    <dbl> 2, 3, 1, 1, 1, 3, 1, 3, 1, 1, 3, 3, 3, 3, 1, 1, 2, 1, 3, 1, 3,…
$ smoke   <dbl> 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0,…
$ ptl     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,…
$ hyper   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,…
$ urirr   <dbl> 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0,…
$ pvft    <dbl> 0, 3, 1, 2, 0, 0, 1, 1, 1, 0, 0, 1, 0, 2, 0, 0, 0, 3, 0, 1, 2,…
$ weight  <dbl> 2523, 2551, 2557, 2594, 2600, 2622, 2637, 2637, 2663, 2665, 27…
$ agecat  <dbl+lbl> 1, 4, 2, 2, 1, 2, 2, 1, 3, 3, 1, 1, 2, 4, 1, 1, 1, 3, 2, 3…
$ wcat    <dbl+lbl> 5, 5, 1, 2, 2, 3, 2, 1, 3, 2, 1, 4, 1, 2, 1, 1, 1, 2, 2, 2…
$ anyptl  <dbl+lbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0…
$ newpvft <dbl+lbl> 0, 2, 1, 2, 0, 0, 1, 1, 1, 0, 0, 1, 0, 2, 0, 0, 0, 2, 0, 1…

4.1.1 Converting Variables to Factor Variables

Since the dataset comes with labeled variables, convert to factor variables using the mutate and across functions from the dplyr package.

# Copy into new dataset and keep original dataset
data.o4 <- data.o3

# Convert to factor variables
data.o4 <- data.o4 %>%
  mutate(across(where(is.labelled), ~ as_factor(.x)))

summary(data.o4)
       id             lbw              age             lwt       
 Min.   :  4.0   Min.   :0.0000   Min.   :14.00   Min.   : 80.0  
 1st Qu.: 68.0   1st Qu.:0.0000   1st Qu.:19.00   1st Qu.:110.0  
 Median :123.0   Median :0.0000   Median :23.00   Median :121.0  
 Mean   :121.1   Mean   :0.3122   Mean   :23.24   Mean   :129.8  
 3rd Qu.:176.0   3rd Qu.:1.0000   3rd Qu.:26.00   3rd Qu.:140.0  
 Max.   :226.0   Max.   :1.0000   Max.   :45.00   Max.   :250.0  
      race           smoke             ptl             hyper        
 Min.   :1.000   Min.   :0.0000   Min.   :0.0000   Min.   :0.00000  
 1st Qu.:1.000   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.00000  
 Median :1.000   Median :0.0000   Median :0.0000   Median :0.00000  
 Mean   :1.847   Mean   :0.3915   Mean   :0.1958   Mean   :0.06349  
 3rd Qu.:3.000   3rd Qu.:1.0000   3rd Qu.:0.0000   3rd Qu.:0.00000  
 Max.   :3.000   Max.   :1.0000   Max.   :3.0000   Max.   :1.00000  
     urirr             pvft            weight       agecat        wcat   
 Min.   :0.0000   Min.   :0.0000   Min.   : 709   <20  :51   <105   :37  
 1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:2414   20-24:69   106-120:55  
 Median :0.0000   Median :0.0000   Median :2977   25-29:42   121-130:31  
 Mean   :0.1481   Mean   :0.7937   Mean   :2945   30+  :27   131-150:30  
 3rd Qu.:0.0000   3rd Qu.:1.0000   3rd Qu.:3475              151+   :36  
 Max.   :1.0000   Max.   :6.0000   Max.   :4990                          
 anyptl   newpvft 
 0 :159   0 :100  
 1+: 30   1 : 47  
          2+: 42  
                  
                  
                  
glimpse(data.o4)
Rows: 189
Columns: 15
$ id      <dbl> 85, 86, 87, 88, 89, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 1…
$ lbw     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
$ age     <dbl> 19, 33, 20, 21, 18, 21, 22, 17, 29, 26, 19, 19, 22, 30, 18, 18…
$ lwt     <dbl> 182, 155, 105, 108, 107, 124, 118, 103, 123, 113, 95, 150, 95,…
$ race    <dbl> 2, 3, 1, 1, 1, 3, 1, 3, 1, 1, 3, 3, 3, 3, 1, 1, 2, 1, 3, 1, 3,…
$ smoke   <dbl> 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0,…
$ ptl     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,…
$ hyper   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,…
$ urirr   <dbl> 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0,…
$ pvft    <dbl> 0, 3, 1, 2, 0, 0, 1, 1, 1, 0, 0, 1, 0, 2, 0, 0, 0, 3, 0, 1, 2,…
$ weight  <dbl> 2523, 2551, 2557, 2594, 2600, 2622, 2637, 2637, 2663, 2665, 27…
$ agecat  <fct> <20, 30+, 20-24, 20-24, <20, 20-24, 20-24, <20, 25-29, 25-29, …
$ wcat    <fct> 151+, 151+, <105, 106-120, 106-120, 121-130, 106-120, <105, 12…
$ anyptl  <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1+, 0, 0, 0, 0, 0, 0, 0…
$ newpvft <fct> 0, 2+, 1, 2+, 0, 0, 1, 1, 1, 0, 0, 1, 0, 2+, 0, 0, 0, 2+, 0, 1…

4.1.2 Creating a Categorical Outcome Variable

Create a categorical outcome variable using cut function. The variable weight is categorized into a 4-category variable bwt4.

# Create categorical outcome variable
data.o4 <- data.o4 %>%
  mutate(bwt4 = cut(weight,
    breaks = c(708, 2500, 2999, 3500, max(weight)),
    labels = c("<=2500", "2501-3000", "3001-3500", ">3500")
  ))

# Summarize dataset by the newly created variable `bwt4`
data.o4 %>%
  select(bwt4, weight) %>%
  group_by(bwt4) %>%
  summarize_at(vars(weight), c(min = min, max = max))
# A tibble: 4 × 3
  bwt4        min   max
  <fct>     <dbl> <dbl>
1 <=2500      709  2495
2 2501-3000  2523  2992
3 3001-3500  3005  3487
4 >3500      3544  4990

4.1.3 Creating an Ordinal Variable

Create an ordinal variable and define the levels of the variable.

# Create and define the levels of ordinal variable
lev <- c(">3500", "3001-3500", "2501-3000", "<=2500")
data.o4 <- data.o4 %>%
  mutate(bwt4a = fct_relevel(bwt4, lev)) %>%
  mutate(bwt4a = ordered(bwt4a, levels = lev))

# Check the structure of the new variable
str(data.o4$bwt4a)
 Ord.factor w/ 4 levels ">3500"<"3001-3500"<..: 3 3 3 3 3 3 3 3 3 3 ...
# Check the levels of `bwt4`
levels(data.o4$bwt4)
[1] "<=2500"    "2501-3000" "3001-3500" ">3500"    
# Check the levels of `bwt4a`
levels(data.o4$bwt4a)
[1] ">3500"     "3001-3500" "2501-3000" "<=2500"   

4.2 Baseline Logit Model

The Baseline Logit Model, also known as the Multinomial Logistic Regression Model, is used to handle situations where the outcome variable is nominal with more than two categories. This model extends the binary logistic regression model to multiple categories by modeling the log-odds of each category relative to a baseline category.

This exercise demonstrates how to replicate the baseline logit model (unconstrained) which requires a dataset with an outcome variable without ordering. In this dataset data.o4, a new variable bwt4b without ordering is created from bwt4 by re-leveling the factors according to the specified levels (lev).

# Create new variable and re-leveling according to the specified levels
data.o4 <- data.o4 %>% mutate(bwt4b = fct_relevel(bwt4, lev))

# Fit the model
bl.mod <- vglm(formula = bwt4 ~ smoke, family = multinomial, data = data.o4)
summary(bl.mod)

Call:
vglm(formula = bwt4 ~ smoke, family = multinomial, data = data.o4)

Coefficients: 
              Estimate Std. Error z value Pr(>|z|)   
(Intercept):1  -0.1881     0.2511  -0.749  0.45392   
(Intercept):2  -0.4643     0.2721  -1.707  0.08791 . 
(Intercept):3  -0.1881     0.2511  -0.749  0.45392   
smoke:1         1.1914     0.4328   2.753  0.00591 **
smoke:2         0.8390     0.4769   1.759  0.07853 . 
smoke:3         0.6234     0.4613   1.351  0.17658   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Names of linear predictors: log(mu[,1]/mu[,4]), log(mu[,2]/mu[,4]), 
log(mu[,3]/mu[,4])

Residual deviance: 510.9718 on 561 degrees of freedom

Log-likelihood: -255.4859 on 561 degrees of freedom

Number of Fisher scoring iterations: 4 

No Hauck-Donner effect found in any of the estimates


Reference group is level  4  of the response

  • smoke:1 with an estimate of 1.1914 implies a positive effect of smoking on the probability of being in the first category (<=2500) relative to the reference category (>3500).
  • smoke:2 and smoke:3 have positive estimates, suggesting an increased likelihood of being in these categories as well due to smoking.
# Calculate the exponentiated coefficients (RRR)
exp(coef(bl.mod))
(Intercept):1 (Intercept):2 (Intercept):3       smoke:1       smoke:2 
    0.8285714     0.6285714     0.8285714     3.2915361     2.3140496 
      smoke:3 
    1.8652038 

  • For smoke:1, the RRR is 3.2915, meaning smokers are about 3.29 times more likely to be in the lowest birth weight category (<=2500) compared to the reference category (>3500).

4.3 Continuation Ratio Model

The Continuation Ratio Model is another approach to modeling ordinal response data, particularly useful when the response categories have a natural order and when the sequential nature of the response process makes clinical or logical sense. This model is different from other ordinal logistic models in that it models the log-odds of being in a particular category given that the respondent has not been in any previous categories. The model is used to analyze ordinal response data by comparing categories sequentially. In this case, the weight categories are compared in the following manner:

  1. Compare bwt >= 3500 vs. bwt = 3000-3500
  2. Compare bwt >= 3500 and bwt = 3000-3500 vs. bwt = 2500-3000
  3. Compare bwt >= 3500 and bwt = 3000-3500 and bwt = 2500-3000 vs. bwt < 2500
# Check the distribution of weight categories
table(data.o4$bwt4)

   <=2500 2501-3000 3001-3500     >3500 
       59        38        46        46 

4.3.1 First Model

# Filter the dataset to include only the weight categories `>3500` and `3001-3500`
data.o4a <- data.o4 %>% 
  filter(bwt4a == '>3500' | bwt4a == '3001-3500')

# Check the distribution of weight categories
table(data.o4a$bwt4a)

    >3500 3001-3500 2501-3000    <=2500 
       46        46         0         0 
# Fit the first model that predicts `bwt4a` based on `smoke`
cr.mod1 <- glm(bwt4a ~ smoke, 
               family = binomial(link ='logit'), 
               data = data.o4a)

summary(cr.mod1)

Call:
glm(formula = bwt4a ~ smoke, family = binomial(link = "logit"), 
    data = data.o4a)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)
(Intercept)  -0.1881     0.2511  -0.749    0.454
smoke         0.6234     0.4613   1.351    0.177

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 127.54  on 91  degrees of freedom
Residual deviance: 125.68  on 90  degrees of freedom
AIC: 129.68

Number of Fisher Scoring iterations: 4

4.3.2 Second Model

# Filter the dataset to include the weight categories `>3500`, `3001-3500`, and `2501-3000`
data.o4b <- data.o4 %>% 
  filter(bwt4 == '>3500' | bwt4 == '3001-3500' | bwt4 == '2501-3000')

# Recode `bwt4a` variable into a new variable `bwt4b`
data.o4b <- data.o4b %>% 
  mutate(bwt4b = ifelse(bwt4a == '>3500', 0, ifelse(bwt4a == '3001-3500', 0, 1)))

# Check the distribution of weight categories
table(data.o4b$bwt4b)

 0  1 
92 38 
# Fit the second model with the recoded `bwt4b` as the dependent variable and `smoke` as the independent variable
cr.mod2 <- glm(bwt4b ~ smoke, 
               family = binomial(link ='logit'), 
               data = data.o4b)

summary(cr.mod2)

Call:
glm(formula = bwt4b ~ smoke, family = binomial(link = "logit"), 
    data = data.o4b)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -1.0678     0.2471  -4.321 1.56e-05 ***
smoke         0.5082     0.3991   1.273    0.203    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 157.09  on 129  degrees of freedom
Residual deviance: 155.49  on 128  degrees of freedom
AIC: 159.49

Number of Fisher Scoring iterations: 4

4.3.2.1 Third Model

# Recode `bwt4a` variable into `bwt4c` where all categories are grouped into a new variable
data.o4c <- data.o4 %>%
  mutate(bwt4c = fct_recode(bwt4a, 
                            gp0 = '>3500',
                            gp0 = '3001-3500',
                            gp0 = '3001-3500',
                            gp0 = '2501-3000'))

# Check the distribution of weight categories
table(data.o4c$bwt4c)

   gp0 <=2500 
   130     59 
# Fit the third model to compare the recoded groups with `smoke` as the predictor
cr.mod3 <- glm(bwt4c ~ smoke, 
               family = binomial(link ='logit'), 
               data = data.o4c)

summary(cr.mod3)

Call:
glm(formula = bwt4c ~ smoke, family = binomial(link = "logit"), 
    data = data.o4c)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -1.0871     0.2147  -5.062 4.14e-07 ***
smoke         0.7041     0.3196   2.203   0.0276 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 234.67  on 188  degrees of freedom
Residual deviance: 229.80  on 187  degrees of freedom
AIC: 233.8

Number of Fisher Scoring iterations: 4

4.3.3 Conclusion

The three estimates from the different continuation-ratio models are quite similar, with values around 0.6. These estimates indicate that the odds of a birth in the next lower weight category relative to the higher weight categories among women who smoked during pregnancy is about 1.8 times that of women who did not smoke. In summary, smoking during pregnancy consistently increases the odds of having a baby in a lower weight category by approximately 1.8 times across the different comparisons made using the continuation-ratio model.

 

A dedicated work by Dr Muhammad Saufi